Method, device, and storage medium for lesion segmentation and recist diameter prediction via click-driven attention and dual-path connection

ABSTRACT

The present disclosure provides a method, a device, and a storage medium for prior-guided dual-path network (PDNet). The method includes inputting an image into a split-attention network to extract a feature map at each scale and compressing the feature map to form a compressed feature map of each scale, by an image encoder, inputting the compressed feature map and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting the attention enhanced feature map to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale, in combination with up-sampled feature maps and/or down-sampled feature maps from other scales, to form a concatenated feature map of the current scale; and attaching a deconvolutional layer to a highest-level scale SA to segment a lesion and predict a RECIST diameter based on concatenated feature maps.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of U. S. Provisional Patent Application Nos. 63/174,821 and 63/174,826, both filed on Apr. 14, 2021, the entire content of which is incorporated herein by reference.

FIELD OF THE DISCLOSURE

The present disclosure generally relates to the field of lesion measuring technology and, more particularly, relates to a method, a device, and a storage medium for lesion segmentation and RECIST diameter prediction via click-driven attention and dual-path connection.

BACKGROUND

Assessing lesion growth across multiple time points is a major task for radiologists and oncologists. The sizes of lesions are important clinical indicators for monitoring disease progression and therapy response in oncology. A widely-used guideline is RECIST (response evaluation criteria in solid tumors), which requires users to first select an axial slice where the lesion has the largest spatial extent, then measure the longest diameter of the lesion (long axis), followed by its longest perpendicular diameter (short axis). Such process is highly tedious and time-consuming; and more importantly, it is prone to inconsistency between different observers, even with considerable clinical knowledge. Segmentation masks may be another quantitative and meaningful metric to assess lesion sizes, which is arguably more accurate and/or precise than RECIST diameters and avoids the subjectivity of selecting long and short axes. However, it is impractical and infeasible for radiologists to manually delineate the contour of every target lesion on a daily basis due to required heavy workload.

Deep learning-based computer-aided diagnosis techniques have been extensively studied by researchers, including automatic lesion segmentation. Most existing works focus on tumors of specific types, such as lung nodules, liver tumors, and lymph nodes. However, radiologists often encounter different types of lesions when reading images. Universal lesion segmentation and measurement have drawn attention in recent years, aiming at learning from a large-scale dataset to handle a variety of lesions in one method. These works leverage NIH DeepLesion dataset, which contains the RECIST annotations of over 30K lesions of various types. Among them, users are required to draw a box around the lesion to indicate the lesion of interest. It first employs a spatial transform network to normalize the lesion region, then adapts a stacked hourglass network to regress the four endpoints of the RECIST diameters. Users are required to only click a point on or near the lesion, which is more convenient and efficient than some existing approaches. An improved mask R-CNN is used to detect the lesion region, and segmentation and RECIST diameter prediction are subsequently performed. User click information is fed into the model as the input together with the image. Such strategy treats lesions with diverse sizes and shapes in a same way, thus may not be optimal at locating the lesion region precisely.

BRIEF SUMMARY OF THE DISCLOSURE

One aspect or embodiment of the present disclosure provides a prior-guided dual-path network (PDNet) method. The method includes inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, where the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and scale-aware attention (SA) module is configured to adaptively select a scale or feature map for a lesion; and further includes attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on the concatenated feature maps.

Optionally, the split-attention network includes a number of blocks, each block outputting a feature map of one scale of the multiple scales; and a convolutional layer is used to compress the feature map of each scale along the channel dimension to form the compressed feature map of each scale.

Optionally, the three-channel image includes an original image, a click image, and a distance transform image.

Optionally, the prior encoder includes a number of atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer.

Optionally, 6 and 3 side outputs are added in the decoder to introduce deep mask supervision and deep diameter supervision, respectively.

Optionally, the dual-path connection includes top-down connection corresponding to the one or more down-sampled feature maps, and bottom-up connection corresponding to the one or more up-sampled feature maps.

Optionally, the SA module is configured to adaptively select the features from a corresponding concatenated feature map along a corresponding channel dimension for the lesion.

Optionally, the method includes a first stage and a second stage, where the first stage is configured to extract lesion of interest, and the second stage is configured to obtain lesion segmentation and RECIST diameter prediction from the extracted lesion of interest.

Another aspect or embodiment of the present disclosure provides a prior-guided dual-path network (PDNet) device. The device includes a memory, containing a computer program stored thereon; and a processor, coupled with the memory and configured, when the computer program being executed, to perform a method including: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, where the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and scale-aware attention (SA) module is configured to adaptively select a scale or feature map for a lesion; and further includes attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on the concatenated feature maps.

Optionally, the split-attention network includes a number of blocks, each block outputting a feature map of one scale of the multiple scales; and a convolutional layer is used to compress the feature map of each scale along the channel dimension to form the compressed feature map of each scale.

Optionally, the three-channel image includes an original image, a click image, and a distance transform image.

Optionally, the prior encoder includes a number of atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer.

Optionally, 6 and 3 side outputs are added in the decoder to introduce deep mask supervision and deep diameter supervision, respectively.

Optionally, the dual-path connection includes top-down connection corresponding to the one or more down-sampled feature maps, and bottom-up connection corresponding to the one or more up-sampled feature maps.

Optionally, the SA module is configured to adaptively select the features from a corresponding concatenated feature map along a corresponding channel dimension for the lesion.

Optionally, the method includes a first stage and a second stage, where the first stage is configured to extract lesion of interest, and the second stage is configured to obtain lesion segmentation and RECIST diameter prediction from the extracted lesion of interest.

Another aspect or embodiment of the present disclosure provides a storage medium storing program instructions configured to be executable by a computer to cause the computer to implement operations including: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along the channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, where the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and scale-aware attention (SA) module is configured to adaptively select a scale or feature map for a lesion; and further includes attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on the concatenated feature maps.

Other aspects or embodiments of the present disclosure can be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings are merely examples for illustrative purposes according to various disclosed embodiments and are not intended to limit the scope of the present disclosure.

FIG. 1 depicts an exemplary configuration diagram of a prior-guided dual-path network (PDNet) according to various disclosed embodiments of the present disclosure;

FIG. 2 depicts a flowchart illustrating a prior-guided dual-path network (PDNet) method according to various disclosed embodiments of the present disclosure; and

FIG. 3 depicts exemplary visual results on a DeepLesion test set and an external test set according to various disclosed embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference may be made in detail to exemplary embodiments of the disclosure, which are illustrated in the accompanying drawings. Wherever possible, same reference numbers may be used throughout the drawings to refer to same or like parts.

While examples and feature maps of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. Also, the words “comprise”, “have”, “contain”, “include”, and other similar forms are intended to be equivalent in meaning and be interpreted as open ended, such that an item or items following any one of these words is not meant to be an exhaustive listing of the item or items, or meant to be limited to only the listed item or items. And the singular forms “a”, “an” and “the” are intended to include plural references, unless the context clearly dictates otherwise.

The present disclosure provides a framework named prior-guided dual-path network (PDNet). Given a 2D computed tomography (CT) slice and a click guidance in a lesion region, the objective may be to segment the lesion and predict its RECIST diameters automatically and reliably. To achieve above-mentioned objective, a two-stage framework may be adopted. At the first stage, the lesion of interest (LOI) by segmentation rather than detection may be extracted, since sometimes the detection results do not cover the lesions which are clicked in. At the second stage, the lesion segmentation and RECIST diameter prediction results may be obtained from the extracted LOI. A prior encoder may be designed to encode the click prior information into attention maps, which can deal with considerable size and shape variations of the lesions. Furthermore, a scale-aware attention block with dual-path connection may be designed to improve the decoder. In various embodiments, the decoder disclosed herein may be a hardware decoder or a hardware/software decoder. The PDNet may be evaluated on manually-labeled lesion masks and RECIST diameters in a DeepLesion dataset.

Various embodiments of the present disclosure provide a method, a device, and a storage medium for lesion segmentation and RECIST diameter prediction via click-driven attention and dual-path connection. The lesion segmentation and RECIST diameter prediction via click-driven attention and dual-path connection are described in detail hereinafter.

FIG. 1 depicts an exemplary configuration diagram of the PDNet according to various disclosed embodiments of the present disclosure. The PDNet may include two stages, where the first stage may extract LOI by segmentation, and the second stage may perform lesion segmentation and RECIST diameter prediction from the extracted LOL. Referring to FIG. 1, the PDNet may include three components including an image encoder, a prior encoder (PE) with click-driven attention, and a decoder with dual-path connection having multiple scale-aware attention modules (SAs).

According to various embodiments of the present disclosure, the PDNet may include the image encoder. The image encoder may extract highly discriminative feature maps (e.g., each feature map including corresponding features) from an input CT image. Also, representing feature maps at multiple scales may be of great importance for performing tasks in various embodiments of the present disclosure. It should be noted that, in the existing technology, split-attention blocks are stacked in ResNet style to create a split-attention network, named ResNeSt. ResNeSt may be able to capture cross-channel feature map correlations by combining the channel-wise attention with multi-path network layout, and may universally improve the learned feature map representations to boost performance across various vision tasks. Therefore, ResNeSt-50 may be used as backbone to extract highly discriminative multiple scale feature maps in the image encoder. Referring to FIG. 1, ResNeSt-50 may have five blocks which output multiple scale feature maps with different channels; and in order to relieve the computation burden, the multiple scale feature maps may be compressed to 32 channels using a convolutional layer with 32 3×3 kernels. In one embodiment, the input of a block n may be the output feature maps of a block n−1, where n=2,3,4,5.

According to various embodiments of the present disclosure, the PDNet may further include the prior encoder with click-driven attention. Given a click guidance, a click image and a distance transform image may be generated and considered as prior information, as shown in FIG. 1. In the existing technology, the prior information is integrated into the model by directly treating it as input for feature extraction. The representation ability of the feature maps extracted from image encoder may be weaken by such strategy because the sizes and shapes of different lesions are highly diverse, but their prior information generated using such strategy are same. To avoid the above-mentioned problem, the prior encoder (PE) with click-driven attention may be separately built, which is able to learn lesion-specific attention matrices by effectively exploring the click prior information. With lesion-specific attention matrices, the representation ability of the extracted multiple scale feature maps from image encoder may be enhanced to improve the task performance. Referring to FIG. 1, the prior encoder may be inputted with the compressed multiple scale feature maps and a 3-channel image (the original CT image, the click image, and the distance transform image), and output attention enhanced multiple scale feature maps. The prior encoder may include a number of (e.g., 5) atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer, for example, with 32 3×3 kernels and astride of 2. The detailed structure of ASPP based attention module may refer to FIG. 1, where 5 side outputs (the solid tilted arrows) may be added to introduce the deep mask supervision to learn the attention matrices.

According to various embodiments of the present disclosure, the PDNet may further include the decoder with dual-path connection having multiple SAs. It should be noted that the low-level scale feature maps focus on fine-grained lesion parts (e.g., edges) but are short of global contextual information, while the high-level scale feature maps are capable of segmenting the entire lesion regions coarsely but at the cost of losing certain detailed information. Unlike UNet where the decoder only considers current scale feature maps and their neighboring high-level scale feature maps gradually, the decoder which can aggregate the attention enhanced multiple scale feature maps more comprehensively may be provided in various embodiments of the present disclosure. Each scale feature maps may be reasonably interacted with all lower-level and higher-level scale feature maps in the decoder, which is accomplished by using dual-path connection including top-down connection and bottom-up connection. The top-down connection (T2D) may adopt a bilinear interpolation operation on the high-level scale feature maps for up-sampling followed by a convolutional layer with 32 3×3 kernels for smoothing. The bottom-up connection (B2U) may perform a convolution operation with 32 3×3 kernels and a large stride for down-sampling. Then the current scale feature map may be concatenated with all up-sampled and down-sampled feature maps from other scales in a channel dimension, indicating that each concatenated feature map can represent the global contextual and local detail information of the lesion. The concatenated feature maps may be directly used for lesion segmentation or RECIST diameter prediction with a convolutional layer of 1 or 4 3×3 kernels. However, before the direct usage of the concatenated feature maps, to further improve the feature map representations, the SA may be built based on the channel attention mechanism of DANet, which selectively emphasizes interdependent channel feature maps by integrating associated feature maps among all feature map channels. The SA's structure is shown in FIG. 1. Different lesions have different scales, but the SA may be able to adaptively select suitable scales or channel feature maps for different lesions for better accuracy. To obtain a full-size prediction, a deconvolutional layer with 32 4×4 kernels and a stride of 2 may be attached to the highest-level scale SA (e.g., the uppermost or last SA in FIG. 1). In one embodiment, as shown in FIG. 1, only the concatenated feature map from the highest-level scale SA may be considered as the input of the deconvolutional layer; and there may be no deconvolutional layer for the concatenated feature maps of other scales. Furthermore, 6 and 3 side outputs may be added in the decoder to introduce the deep mask supervision (the solid tilted arrows) and deep diameter supervision (the dotted tilted arrows), respectively. The deep diameter supervision may be only used for high-resolution side outputs, because a high-quality RECIST diameter prediction requires large spatial and detailed information.

According to various embodiments of the present disclosure, for model optimization, the RECIST diameter prediction problem may be converted into a key point regression problem. It indicates that the model may predict four key point heatmaps to locate four endpoints of the RECIST diameters. For both tasks, a mean squared error loss (l_(mse)) may be used to compute the errors between predictions and supervisions. A pixel-wise loss may be affected by imbalanced foreground and background pixels. Lesion and non-lesion regions may be highly imbalanced at stage 1 of the PDNet. To solve such problem, an additional IOU loss (l_(iou)) may be introduced for the lesion segmentation task, which handles the global structures of lesions instead of every single pixel. As described above, 11 side outputs with deep mask supervision and 3 side outputs with deep diameter supervision may be used in the PDNet. Therefore, the lesion segmentation loss is expressed as:

l _(seg)=Σ_(i=1) ¹¹[l _(mse) ^(i) +l _(iou) ^(i)]

and the RECIST diameter prediction loss is expressed as:

l _(dp)=Σ_(i=1) ³ l _(mse) ^(i)

The final loss is:

l=λl _(seg)+(1−λ)l _(dp)

where λ may be set to 0.01 to balance the magnitude of such two losses. Two PDNet models used in above-mentioned two stages may be trained separately.

For the segmentation task, manual lesion masks in DeepLesion may not be available. Therefore, an ellipse may be first constructed from an existing RECIST annotation. Then the morphological snake (MS) approach may be used to refine the ellipse to get a pseudo mask with desirable quality, serving as the mask supervision. For l_(dp) update, four 2D Gaussian heatmaps with a standard deviation of σ may be generated from four endpoints of each RECIST annotation, serving as the diameter supervision. σ=3 may be set at stage 1, and σ=7 may be set at stage 2. Furthermore, an iterative refining strategy may be applied in various embodiments of the present disclosure. When the training is completed, the model may be run over all training data to get their lesion segmentation results, and then the MS approach may be used to refine the above-mentioned results. With an ellipse and a refined segmentation result, the pseudo mask may be updated by setting their intersections as the foreground, their differences as uncertain regions that may be ignored for loss computation during training, and the rest as the background. The new pseudo masks may be used to retrain the models; and final models may be obtained after three training iterations.

FIG. 2 depicts a flowchart illustrating a PDNet method according to various disclosed embodiments of the present disclosure.

In S202, a CT (computed tomography) image may be inputted into a split-attention network to extract a feature map at each scale of multiple scales, each scale corresponding to one channel, and the feature map of each scale along the channel dimension may be compressed to form a compressed feature map of each scale, by an image encoder. In one embodiment, referring to FIG. 1, the split-attention network may include a number of blocks, such as block 1, block 2, block 3, block 4, and block 5, respectively; and each block may output a feature map of one scale of the multiple scales. A convolutional layer with 32 3×3 kernels may be used to compress the feature map of each scale along the channel dimension to form the compressed feature map of each scale. In one embodiment, each scale may have different numbers of channels, and all channels in a same scale may have a same dimension/size (e.g., a same height).

In S204, the compressed feature map of each scale and a three-channel image may be inputted into a prior encoder to generate an attention enhanced feature map of each scale, and the attention enhanced feature map of each scale may be outputted, by the prior encoder, to a decoder. The three-channel image may include an original image, a click image, and a distance transform image. The prior encoder may include a number of atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer. 5 side outputs may be added to introduce the deep mask supervision to team the attention matrices.

In S206, an attention enhanced feature map at a current scale of the multiple scales may be concatenated, by the decoder, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales to form a concatenated feature map of the current scale. The one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and a scale-aware attention (SA) module is configured to adaptively select features from a corresponding concatenated feature map, e.g., along a channel dimension, for a lesion, 6 and 3 side outputs may be added in the decoder to introduce deep mask supervision and deep diameter supervision, respectively. For example, for the highest-level scale SA (e.g., the uppermost or last SA in FIG. 1), the attention enhanced feature map at the highest-level scale may be concatenated, by the decoder, in combination with 4 up-sampled feature maps by up-sampling the concatenated feature maps outputted from other 4 SAs to form the concatenated feature map of the highest-level scale. For another example, for the lowest-level scale SA (e.g., the lowermost or first SA in FIG. 1), the attention enhanced feature map at the lowest-level scale may be concatenated, by the decoder, in combination with 4 down-sampled feature maps by down-sampling the attention enhanced feature maps at 4 other scales to form the concatenated feature map of the lowest-level scale.

In S208, a deconvolutional layer may be attached to the highest-level scale SA of multiple SAs to segment the lesion and predict a RECIST diameter of the lesion based on the concatenated feature maps. For example, a deconvolutional layer with 32 4×4 kernels and a stride of 2 may be attached to the highest-level scale SA (e.g., the uppermost or last SA in FIG. 1) to make full-size prediction.

According to various embodiments of the present disclosure, datasets and evaluation criteria may be described in detail hereinafter. In an implementation manner of the present disclosure, DeepLesion dataset may contain 32, 735 CT lesion images with RECIST diameter annotations from 10, 594 studies of 4, 459 patients. Various lesions throughout the whole body may be included, such as lung nodules, bone lesions, liver tumors, enlarged lymph nodes, and the like. 1000 lesion images from 500 patients with manual segmentations may serve as a test set; and rest patient data may be used for training. An external test set with 1350 lesions from 900 patients may be built for external validation by collecting lung, liver, pancreas, kidney tumors, and lymph nodes from multiple public datasets, including Decathlon-Lung (50), LIDC (200), DecathlonHepaticVessel (200), Decathlon-Pancreas (200), KiTS (150), and NIH-Lymph Node (100), where each lesion has a 3D mask. In order to be suitable for evaluation, an axial slice may be selected for each lesion where the lesion has the largest spatial extent based on its 3D mask. The long and short diameters calculated from the 2D lesion mask of the selected slice may be treated as the ground truths of the RECIST diameters. The pixel-wise precision, recall, and dice coefficient (Dice or DICE) may be used for lesion segmentation. The mean and standard deviation of differences between the diameter lengths (mm) of the predictions and manual annotations may be used for RECIST diameter prediction.

In an implementation manner of the present disclosure, the PDNet may be implemented in PyTorch; and the image encoder may be initialized with ImageNet pre-trained weights. At both stages, the PDNet may be trained using Adam optimizer with an initial learning rate of 0.001 for 120 epochs and decay it by 0.1 after 60 and 90 epochs. During training, all CT images may be first resized to 512×512. Then the input images may be generated by randomly rotating by θ∈[−10°, 10° ] and cropping a square sub-image whose size is s∈[480, 512] at stage 1 and 1.5 to 3.5 times as large as the lesion's long side with random offsets at stage 2. The images may be resized to 512×512 and 256×256 for both stages, respectively. For testing, input images may be generated by resizing to 512×512 at stage 1 or cropping a square sub-image whose size is 2.5 times the long side of lesion segmentation result produced by the first PDNet model at stage 2. To mimic the clicking behavior of a radiologist, a point may be randomly selected from a region obtained by eroding the ellipse to half of its size.

As a powerful segmentation framework, nnUNet built based on UNets has been successfully used in various medical image segmentation tasks, thus it can serve as a strong baseline in various embodiments of the present disclosure. An nnUNet model may be trained for each stage by taking as input the 3-channel image and using the same setting as PDNet. At stage 1, nnUNet may produce poor segmentation performance, for example, the DICE score may be about 0.857 on the DeepLesion test set, suggesting that the LOI extracted is not suitable enough to serve as the input of stage 2. Therefore, the 2nd nnUNet may take as input the LOIs extracted by the 1st PDNet, achieving a DICE score of 0.911.

FIG. 3 depicts exemplary visual results on a DeepLesion test set and an external test set according to various disclosed embodiments of the present disclosure. Five visual examples of the results produced by the nnUNet and the PDNet may be shown in FIG. 3. The 1st PDNet can segment the lesion region (FIG. 3(c)), even if they are small (the 4th row), heterogeneous (the 2nd row), or have blurry boundaries (the 5th row), irregular shapes (the 3rd row) and the like, which indicates that the LOIs can be extracted reliably at stage 1 (FIG. 3(b)). In addition, the lesion segmentation results may be improved significantly by the 2nd PDNet (FIG. 3(e)), but a part of the results may become worse when using the 2nd nnUNet (e.g., the 1st and 3rd rows in FIG. 3(d)). Furthermore, The RECIST diameters predicted by the PDNet may be significantly closer to the references than the nnUNet. The qualitative results may validate that the PDNet method (e.g., framework) provided in various embodiments of the present disclosure can segment the lesions and predict their RECIST diameters reliably using only a click guidance. An external test set from 6 public lesion datasets of 5 organs may be additionally collected to demonstrate the generalizability of the PDNet method provided in various embodiments of the present disclosure.

In one embodiment of the present disclosure, visual examples of results on the DeepLesion test set (the first three rows) and the external test set (the last two rows) may be shown in FIG. 3, where gray and white curves/crosses are the manual annotations and automatic results, respectively. Referring to FIG. 3, given a CT image (FIG. 3 (a)) and a click guidance, the 1st PDNet may produce an initial lesion segmentation result (FIG. 3 (c)) at stage 1, based on which a LOI (FIG. 3 (b)) may be extracted and taken as input of stage 2; and the final results of lesion segmentation (left) and RECIST diameter prediction (right) may be obtained by the 2nd nnUNet (FIG. 3 (d)) and the 2nd PDNet (FIG. 3 (e)).

Table 1 outlines quantitative results of lesion segmentation and RECIST diameter prediction using different methods on two test sets. The mean and standard deviation of all metrics are reported. It can be seen that: 1) the PDNet may obtain relatively high Dice score, and also obtain relatively small diameter errors on the DeepLesion test set, which indicates that the PDNet can simultaneously segment the lesions accurately and produce reliable RECIST diameters close to the radiologists' manual annotations; 2) compared to the strong baseline nnUNet, the PDNet may obtain much desirable results on both test sets, which is because the PDNet is able to extract more comprehensive multiple scale feature maps to better represent the appearances of different kinds of lesions; and 3) compared to the DeepLesion test set, the performance may drop for both the nnUNet and the PDNet on the external test set, for example, the Dice score of the PDNet decreases from 0.924 to 0.885. In the external test set, some lesion masks may not be well annotated, thus the generated ground-truth RECIST diameters may also be affected. In addition, the segmentation results produced by the PDNet may be better aligned to the lesion boundaries visually.

TABLE 1 Lesion segmentation RECIST diameter prediction Method Precision Recall Dice Long axis Short axis DeepLesion test set nnUNet 0.977 ± 0.033 0.852 ± 0.086 0.907 ± 0.050 2.108 ± 1.997 1.839 ± 1.733 PDNet 0.961 ± 0.044 0.898 ± 0.077 0.924 ± 9.045 1.733 ± 1.470 1.524 ± 1.374 External test set nnUNet 0.946 ± 0.062 0.815 ± 0.099 0.870 ± 0.054 2.334 ± 1.906 1.985 ± 1.644 PDNet 0.927 ± 0.074 0.857 ± 0.093 0.885 ± 0.049 2.174 ± 1.437 1.829 ± 1.339

Table 2 outlines the category-wise results in terms of segmentation Dice and the prediction error of diameter lengths on the external test set. The PDNet may achieve better performance in terms of all metrics and categories except the RECIST diameter prediction on lung and kidney tumors. A possible reason may be that a pan of lung and kidney tumors have highly irregular shapes, whose diameters generated from manual masks are highly likely to be larger, and the nnUNet may tend to predict larger diameters than the PDNet in these cases. These results may evidently demonstrate the effectiveness and robustness of the PDNet method provided in various embodiments of the present disclosure.

TABLE 2 Method Lung Liver Pancreas Kidney Lymph node Lesion segmentation (Dice) nnUNet 0.853 ± 0.054 0.876 ± 0.057 0.877 ± 0.055 0.890 ± 0.057 0.865 ± 0.050 PDNet 0.876 ± 0.046 0.893 ± 0.051 0.886 ± 0.050 0.911 ± 0.050 0.876 ± 0.045 RECIST diameter prediction (long axis) nnUNet 2.396 ± 2.004 2.862 ± 2.090 2.655 ± 2.048 2.493 ± 1.963 1.958 ± 1.639 PDNet 2.435 ± 1.461 2.378 ± 1.463 2.220 ± 1.536 2.603 ± 1.533 1.897 ± 1.293 RECIST diameter prediction (short axis) nnUNet 2.223 ± 1.404 2.383 ± 1.808 2.242 ± 1.637 2.342 ± 1.854 1.712 ± 1.440 PDNet 2.243 ± 1.333 2.168 ± 1.405 1.977 ± 1.359 2.362 ± 1.488 1.486 ± 1.174

For ablation studies provided in various embodiments of the present disclosure, to investigate the contributions of the PDNet components, including the prior encoder (PE), the top-down connection (T2D), the bottom-up connection (B2U), and the scale-aware attention module (SA), different models may be configured by sequentially adding them into the base model that includes the image encoder with input of the 3-channel image and a UNet-style decoder. Table 3 outlines quantitative results of different settings of the PDNet method in terms of Dice and the prediction error of diameter lengths on the DeepLesion test set. It can be seen that: 1) each added component may improve the performance at both stages, demonstrating that the above-mentioned strategies contribute to learning more comprehensive feature maps for the tasks in various embodiments of the present disclosure; and 2) the largest improvement gain may be obtained by introducing PE, especially for stage 1, demonstrating that PE can effectively explore the click prior information to learn lesion-specific attention matrices which heavily enhances the extracted multiple scale feature maps for performance improvement.

TABLE 3 Settings Stage 1 Stage 2 PE T2D B2U SA Dice Dice Long axis Short axis Base 0.871 ± 0.123 0.909 ± 0.068 1.961 ± 2.278 1.704 ± 1.948 model ✓ 0.890 ± 0.089 0.915 ± 0.055 1.861 ± 1.934 1.617 ± 1.684 ✓ ✓ 0.900 ± 0.067 0.919 ± 0.054 1.809 ± 1.731 1.577 ± 1.508 ✓ ✓ ✓ 0.905 ± 0.070 0.921 ± 0.050 1.758 ± 1.696 1.544 ± 1.470 ✓ ✓ ✓ ✓ 0.911 ± 0.060 0.924 ± 0.045 1.733 ± 1.470 1.524 ± 1.374

According to various embodiments of the present disclosure, the deep neural network PDNet method (e.g., framework) may be designed for accurate and automatic lesion segmentation and RECIST diameter prediction, where the PDNet method may work in a two-stage manner. It should be noted that the PDNet method may outperform the existing technology and demonstrate a strong baseline nnUNet on two test sets, for both lesion segmentation and RECIST diameter prediction tasks. Providing extremely simple human guide information, the LOI may be extracted precisely by segmentation at stage 1 and its segmentation and RECIST diameters may be predicted accurately at stage 2. In such way, it may offer a useful tool for radiologists to get reliable lesion size measurements, including segmentation and RECIST diameters, with greatly reduced time and labor, and may potentially provide high positive clinical values.

According to various embodiments of the present disclosure, the click guidance from radiologists may be the only requirement. There are two key characteristics in the PDNet method: 1) learning lesion-specific attention matrices in parallel from the click prior information by the prior encoder, named click-driven attention; and 2) aggregating the extracted multiple scale feature maps comprehensively by introducing top-down and bottom-up connections in the decoder (e.g., dual-path connection). The PDNet method may demonstrate superiority in lesion segmentation and RECIST diameter prediction using the DeepLesion dataset and an external test set. The PDNet method may learn comprehensive and representative deep image feature maps for the tasks and produces more accurate results on both lesion segmentation and RECIST diameter prediction.

The present disclosure also provides a prior-guided dual-path network (PDNet) device. The device includes a memory, containing a computer program stored thereon; and a processor, coupled with the memory and configured, when the computer program being executed, to perform a method including: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, where the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and scale-aware attention (SA) module is configured to adaptively select a scale or feature map for a lesion; and further includes attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a RECIST diameter of the lesion based on the concatenated feature maps.

The present disclosure also provides a storage medium storing program instructions configured to be executable by a computer to cause the computer to implement operations including: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, where the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and scale-aware attention (SA) module is configured to adaptively select a scale or feature map for a lesion; and further includes attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a RECIST diameter of the lesion based on the concatenated feature maps. According to various embodiments of the present disclosure, a computer program product may include a non-transitory computer-readable storage medium and program instructions stored therein.

While the disclosure has been illustrated with respect to one or more implementations, alterations and/or modifications can be made to the illustrated examples without departing from the spirit and scope of the appended claims. In addition, while a particular feature map of the disclosure may have been disclosed with respect to only one of several implementations, such feature map may be combined with one or more other feature maps of the other implementations as may be desired and advantageous for any given or particular function. Furthermore, to the extent that the terms “include”, “contain”, “have”, “has”, “with”, or variants thereof are used in either the detailed description and the claims, such terms are intended to be inclusive in a manner similar to the term “comprising”. The term “at least one of” is used to mean one or more of the listed items can be selected.

Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the disclosure are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Moreover, all ranges disclosed herein are to be understood to encompass any and all sub-ranges subsumed therein. In certain cases, the numerical values as stated for the parameter can take on negative values.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims. 

What is claimed is:
 1. A prior-guided dual-path network (PDNet) method for medical images, the method comprising: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, wherein: the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and a scale-aware attention (SA) module is configured to adaptively select features for a lesion; and attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on concatenated feature maps.
 2. The method according to claim 1, wherein: the split-attention network includes a number of blocks, each block outputting a feature map of one scale of the multiple scales; and a convolutional layer is used to compress the feature map of each scale along the channel dimension to form the compressed feature map of each scale.
 3. The method according to claim 1, wherein: the three-channel image includes an original image, a click image, and a distance transform image.
 4. The method according to claim 1, wherein: the prior encoder includes a number of atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer.
 5. The method according to claim 1, wherein: 6 and 3 side outputs are added in the decoder to introduce deep mask supervision and deep diameter supervision, respectively.
 6. The method according to claim 1, wherein: the dual-path connection includes top-down connection corresponding to the one or more down-sampled feature maps, and bottom-up connection corresponding to the one or more up-sampled feature maps.
 7. The method according to claim 1, wherein: the SA module is configured to adaptively select the features from a corresponding concatenated feature map along a corresponding channel dimension for the lesion.
 8. The method according to claim 1, wherein: the method includes a first stage and a second stage, wherein the first stage is configured to extract lesion of interest, and the second stage is configured to obtain lesion segmentation and RECIST diameter prediction from the extracted lesion of interest.
 9. A prior-guided dual-path network (PDNet) device for medical images, comprising: a memory, containing a computer program stored thereon; and a processor, coupled with the memory and configured, when the computer program being executed, to perform a method comprising: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, wherein: the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and a scale-aware attention (SA) module is configured to adaptively select features for a lesion; and attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on concatenated feature maps.
 10. The device according to claim 9, wherein: the split-attention network includes a number of blocks, each block outputting a feature map of one scale of the multiple scales; and a convolutional layer is used to compress the feature map of each scale along the channel dimension to form the compressed feature map of each scale.
 11. The device according to claim 9, wherein: the three-channel image includes an original image, a click image, and a distance transform image.
 12. The device according to claim 9, wherein: the prior encoder includes a number of atrous spatial pyramid pooling (ASPP) based attention modules and a convolutional layer.
 13. The device according to claim 9, wherein: 6 and 3 side outputs are added in the decoder to introduce deep mask supervision and deep diameter supervision, respectively.
 14. The device according to claim 9, wherein: the dual-path connection includes top-down connection corresponding to the one or more down-sampled feature maps, and bottom-up connection corresponding to the one or more up-sampled feature maps.
 15. The device according to claim 9, wherein: the SA module is configured to adaptively select the features from a corresponding concatenated feature map along a corresponding channel dimension for the lesion.
 16. The device according to claim 9, wherein: the method includes a first stage and a second stage, wherein the first stage is configured to extract lesion of interest, and the second stage is configured to obtain lesion segmentation and RECIST diameter prediction from the extracted lesion of interest.
 17. A storage medium storing program instructions configured to be executable by a computer to cause the computer to implement operations comprising: inputting an image into a split-attention network to extract a feature map at each scale of multiple scales and compressing the feature map of each scale along a channel dimension to form a compressed feature map of each scale, by an image encoder; inputting the compressed feature map of each scale and a three-channel image into a prior encoder to generate an attention enhanced feature map of each scale, and outputting, by the prior encoder, the attention enhanced feature map of each scale to a decoder; concatenating, by the decoder, an attention enhanced feature map at a current scale of the multiple scales, in combination with one or more up-sampled feature maps and/or one or more down-sampled feature maps from other scales of the multiple scales, to form a concatenated feature map of the current scale, wherein: the one or more up-sampled feature maps and the one or more down-sampled feature maps are obtained using dual-path connection; and a scale-aware attention (SA) module is configured to adaptively select features for a lesion; and attaching a deconvolutional layer to a highest-level scale SA of multiple SAs to segment the lesion and predict a response evaluation criteria in solid tumors (RECIST) diameter of the lesion based on concatenated feature maps. 